2018-01-12 / FMA sub-sampling

Problem statement:
- Input:
  - C csv files
  - each file has n rows. Each row in file c encodes the prediction for class c on a 1sec segment.
  - A target number k
  - Target fractions for class representations p[c].
- Output:
  - A set of k clips, each 10 seconds in duration
  - Aggregate predicted likelihoods for each class c on each clip k
  - Each class c has aggregate likelihood at least p[c] * k

Method:
1. drop edge effects from the beginning and end of tracks: remove the first and last frames from each track.
2. window the frame observations into 10sec clips with aggregate labels
3. threshold the aggregate likelihoods to binarize the representation
4. subsample the 10sec clips using entrofy

Questions:
- How should likelihoods be aggregated within a segment?
  - Mean? Max? Quartile?
  - Mean makes sense from the perspective of random frame sampling
  - Quartile makes sense wrt sparse events
  - Max makes sense wrt extremely sparse events
- How should likelihoods be thresholded? 0.5? Empirical average over X?
  - $p[y] = \sum_x p[y|x] * p[x] \approx \sum_{x \in X} p[y|x] /|X| $
  - But that doesn't matter really. Threshold should be bayes optimal (=> 0.5)
- What's the target number of positives per class k * p[c]?
  - Maybe that should be determined by the base rate estimation p[y]?

Next step: Question scheduling on CF.
- Idea: cluster the tracks according to aggregated likelihood vectors
  - Or maybe by their thresholded likelihoods?
- Set the number of clusters to be relatively large (say, 23^2 ~= 512)
- When generating questions for an annotator, assign them to a cluster and only generate questions from that cluster
- Reasoning: this will keep the labels consistent from one question to the next

UPDATE:
- Windowing and aggregation is happening upstream of this
- Aggregation is max over the middle 8 frames

2018-01-19

Eric has provided the per-fragment aggregated estimates as one giant table
So what are our entrofy parameters?
- attribute thresholds
  - Do we only do <>0.5?
  - Or break likelihood into quartiles?
  - Sounds like quartiles are the way to go
- target proportions per class?
  - we can try to preserve the empirical distribution
  - or a biased distribution achieved by grouping on the track ids?
  - or uniform?
  - Uniform across quartiles for each instrument
- output set size?
  - 20-50 positives per instrument?
  - say, 16 * 4 * n_classes
  - Maybe round up to 1K to start
If we only want one example per track, we can make an aux categorical column that's the track index, and set the target number to 1

2018-02-02

Turns out we didn't get the data transferred in time on 01/19, so still waiting
output set size: 500-1000 positives per class
try both hard threshold and quartile sampling



In [1]:

    
import numpy as np
import pandas as pd
import entrofy



In [2]:

    
import matplotlib.pyplot as plt



In [3]:

    
%matplotlib nbagg



In [50]:

    
df = pd.read_csv('/home/bmcfee/data/vggish-likelihoods-a226b3-maxagg10.csv.gz', index_col=0)



In [51]:

    
df.head(5)









    Out[51]:







  
    
      
      accordion
      bagpipes
      banjo
      bass
      cello
      clarinet
      cymbals
      drums
      flute
      guitar
      ...
      mandolin
      organ
      piano
      saxophone
      synthesizer
      trombone
      trumpet
      ukulele
      violin
      voice
    
  
  
    
      000002_0000
      0.01542
      0.008608
      0.010215
      0.035007
      0.008873
      0.00893
      0.086853
      0.671350
      0.021807
      0.135010
      ...
      0.006079
      0.011073
      0.084341
      0.015115
      0.781432
      0.012166
      0.025021
      0.044818
      0.067646
      0.999691
    
    
      000002_0001
      0.01542
      0.008608
      0.010215
      0.076214
      0.008873
      0.00893
      0.086853
      0.630533
      0.021807
      0.244505
      ...
      0.006079
      0.011073
      0.084341
      0.015115
      0.781432
      0.012166
      0.025021
      0.044818
      0.067646
      0.999691
    
    
      000002_0002
      0.01542
      0.008608
      0.010215
      0.076214
      0.008873
      0.00893
      0.089177
      0.858667
      0.021807
      0.244505
      ...
      0.006079
      0.011073
      0.084341
      0.015115
      0.188291
      0.012166
      0.025021
      0.044818
      0.067646
      0.999691
    
    
      000002_0003
      0.01542
      0.008608
      0.010215
      0.076214
      0.004974
      0.00893
      0.089177
      0.858667
      0.012667
      0.244505
      ...
      0.003388
      0.009051
      0.040380
      0.009120
      0.131694
      0.005950
      0.014247
      0.044818
      0.067646
      0.999691
    
    
      000002_0004
      0.01542
      0.008608
      0.009334
      0.076214
      0.004974
      0.00893
      0.089177
      0.858667
      0.012667
      0.244505
      ...
      0.003388
      0.017866
      0.078745
      0.009120
      0.204007
      0.005950
      0.014247
      0.028634
      0.088025
      0.999691
    
  

5 rows × 23 columns



In [54]:

    
(df >= 0.5).describe().T.sort_values('freq')









    Out[54]:







  
    
      
      count
      unique
      top
      freq
    
  
  
    
      guitar
      29620525
      2
      True
      16736215
    
    
      drums
      29620525
      2
      True
      19005563
    
    
      voice
      29620525
      2
      True
      20766522
    
    
      synthesizer
      29620525
      2
      False
      21567964
    
    
      violin
      29620525
      2
      False
      27651523
    
    
      piano
      29620525
      2
      False
      28699240
    
    
      mallet_percussion
      29620525
      2
      False
      29007329
    
    
      flute
      29620525
      2
      False
      29105788
    
    
      bass
      29620525
      2
      False
      29150189
    
    
      cello
      29620525
      2
      False
      29165508
    
    
      saxophone
      29620525
      2
      False
      29211328
    
    
      organ
      29620525
      2
      False
      29290712
    
    
      accordion
      29620525
      2
      False
      29428347
    
    
      harmonica
      29620525
      2
      False
      29451312
    
    
      bagpipes
      29620525
      2
      False
      29452449
    
    
      trumpet
      29620525
      2
      False
      29461391
    
    
      trombone
      29620525
      2
      False
      29468003
    
    
      cymbals
      29620525
      2
      False
      29489953
    
    
      ukulele
      29620525
      2
      False
      29493362
    
    
      harp
      29620525
      2
      False
      29556563
    
    
      banjo
      29620525
      2
      False
      29563360
    
    
      mandolin
      29620525
      2
      False
      29587198
    
    
      clarinet
      29620525
      2
      False
      29598953



In [71]:

    
df.median()









    Out[71]:





accordion            0.007389
bagpipes             0.006224
banjo                0.004509
bass                 0.155189
cello                0.016092
clarinet             0.006047
cymbals              0.078683
drums                0.647003
flute                0.022228
guitar               0.569506
harmonica            0.008434
harp                 0.005587
mallet_percussion    0.044757
mandolin             0.004615
organ                0.033265
piano                0.119241
saxophone            0.019062
synthesizer          0.292279
trombone             0.012580
trumpet              0.015803
ukulele              0.011841
violin               0.086547
voice                0.819313
dtype: float64

Binary thresholding



In [55]:

    
N_OUT = 23 * 100



In [56]:

    
mappers = {col: entrofy.mappers.ContinuousMapper(df[col],
                                                 prefix=col,
                                                 n_out=2,
                                                 boundaries=[0.0, 0.5, 1.0]) for col in df}



In [ ]:

    
idx, score = entrofy.entrofy(df, N_OUT, mappers=mappers,
                             seed=20180205,
                             quantile=0.05,
                             n_trials=10)



In [64]:

    
df.loc[idx].head(10)









    Out[64]:







  
    
      
      accordion
      bagpipes
      banjo
      bass
      cello
      clarinet
      cymbals
      drums
      flute
      guitar
      ...
      mandolin
      organ
      piano
      saxophone
      synthesizer
      trombone
      trumpet
      ukulele
      violin
      voice
    
  
  
    
      000046_0053
      0.062583
      0.915700
      0.018637
      0.162992
      0.040727
      0.018444
      0.100536
      0.534406
      0.033820
      0.976719
      ...
      0.010782
      0.714110
      0.300557
      0.087529
      0.385306
      0.079978
      0.075567
      0.030833
      0.160921
      0.794626
    
    
      000311_0001
      0.006875
      0.001383
      0.002581
      0.696345
      0.006392
      0.003727
      0.033844
      0.651003
      0.023251
      0.955172
      ...
      0.003475
      0.034006
      0.156438
      0.015048
      0.539802
      0.004837
      0.006913
      0.021388
      0.040315
      0.907296
    
    
      000341_0068
      0.024864
      0.001311
      0.028881
      0.163791
      0.020460
      0.008838
      0.041862
      0.584283
      0.018548
      0.609002
      ...
      0.026970
      0.123292
      0.663510
      0.030823
      0.886328
      0.012027
      0.011733
      0.251555
      0.057407
      0.970078
    
    
      000368_0158
      0.010230
      0.005692
      0.028237
      0.502167
      0.014982
      0.004697
      0.082778
      0.646268
      0.029445
      0.822505
      ...
      0.018546
      0.076291
      0.458668
      0.025077
      0.938266
      0.006261
      0.009553
      0.034601
      0.047432
      0.670003
    
    
      000402_0003
      0.027663
      0.014083
      0.009137
      0.107472
      0.176484
      0.018028
      0.028532
      0.300647
      0.058136
      0.271525
      ...
      0.004556
      0.554254
      0.270120
      0.031469
      0.611561
      0.038181
      0.039272
      0.010984
      0.643716
      0.946539
    
    
      000644_0672
      0.012434
      0.014400
      0.018587
      0.270604
      0.256813
      0.067379
      0.045278
      0.427513
      0.132918
      0.987838
      ...
      0.018609
      0.162592
      0.245520
      0.445257
      0.072400
      0.703359
      0.640719
      0.121939
      0.180639
      0.834285
    
    
      001023_0083
      0.013796
      0.005061
      0.003343
      0.416631
      0.460000
      0.835826
      0.005285
      0.134761
      0.307062
      0.477020
      ...
      0.004188
      0.093485
      0.137888
      0.544438
      0.066581
      0.091138
      0.089995
      0.009233
      0.199989
      0.549448
    
    
      001033_0024
      0.023249
      0.012236
      0.011249
      0.184627
      0.028087
      0.071508
      0.116394
      0.717136
      0.437962
      0.626129
      ...
      0.010670
      0.071872
      0.229724
      0.039138
      0.511643
      0.013548
      0.028309
      0.046583
      0.179872
      0.632515
    
    
      001145_0137
      0.003957
      0.000918
      0.000558
      0.203969
      0.146485
      0.006395
      0.010076
      0.124820
      0.023060
      0.859799
      ...
      0.000823
      0.567757
      0.338795
      0.018945
      0.538661
      0.009014
      0.007018
      0.002454
      0.139204
      0.894864
    
    
      001214_1090
      0.001480
      0.000647
      0.000639
      0.587898
      0.009738
      0.002642
      0.139987
      0.929737
      0.004773
      0.968694
      ...
      0.000823
      0.011013
      0.118221
      0.004195
      0.743730
      0.003819
      0.004699
      0.002008
      0.013668
      0.075813
    
  

10 rows × 23 columns



In [65]:

    
(df.loc[idx] >= 0.5).describe().T.sort_values('freq')









    Out[65]:







  
    
      
      count
      unique
      top
      freq
    
  
  
    
      guitar
      2300
      2
      True
      1153
    
    
      voice
      2300
      2
      False
      1177
    
    
      drums
      2300
      2
      False
      1325
    
    
      synthesizer
      2300
      2
      False
      1499
    
    
      violin
      2300
      2
      False
      1526
    
    
      piano
      2300
      2
      False
      2008
    
    
      cello
      2300
      2
      False
      2011
    
    
      mallet_percussion
      2300
      2
      False
      2018
    
    
      flute
      2300
      2
      False
      2023
    
    
      saxophone
      2300
      2
      False
      2032
    
    
      bass
      2300
      2
      False
      2036
    
    
      trumpet
      2300
      2
      False
      2069
    
    
      accordion
      2300
      2
      False
      2077
    
    
      organ
      2300
      2
      False
      2083
    
    
      harmonica
      2300
      2
      False
      2091
    
    
      trombone
      2300
      2
      False
      2093
    
    
      bagpipes
      2300
      2
      False
      2111
    
    
      ukulele
      2300
      2
      False
      2132
    
    
      cymbals
      2300
      2
      False
      2137
    
    
      banjo
      2300
      2
      False
      2208
    
    
      harp
      2300
      2
      False
      2211
    
    
      mandolin
      2300
      2
      False
      2239
    
    
      clarinet
      2300
      2
      False
      2271



In [69]:

    
!pwd









    



/home/bmcfee/git/cosmir/dev-set-builder/notebooks



In [68]:

    
idx.to_series().to_json('subsample_idx.json')

Multi-valued thresholds



In [ ]:

    
mappers = {col: entrofy.mappers.ContinuousMapper(df[col], n_out=4,
                                                 boundaries=[0.0, 0.25, 0.5, 0.75, 1.0]) for col in df}



In [ ]:



In [3]:

    
idx, score = entrofy.entrofy(df, 1000, mappers=mappers, n_trials=100)

	accordion	bagpipes	banjo	bass	cello	clarinet	cymbals	drums	flute	guitar	...	mandolin	organ	piano	saxophone	synthesizer	trombone	trumpet	ukulele	violin	voice
000002_0000	0.01542	0.008608	0.010215	0.035007	0.008873	0.00893	0.086853	0.671350	0.021807	0.135010	...	0.006079	0.011073	0.084341	0.015115	0.781432	0.012166	0.025021	0.044818	0.067646	0.999691
000002_0001	0.01542	0.008608	0.010215	0.076214	0.008873	0.00893	0.086853	0.630533	0.021807	0.244505	...	0.006079	0.011073	0.084341	0.015115	0.781432	0.012166	0.025021	0.044818	0.067646	0.999691
000002_0002	0.01542	0.008608	0.010215	0.076214	0.008873	0.00893	0.089177	0.858667	0.021807	0.244505	...	0.006079	0.011073	0.084341	0.015115	0.188291	0.012166	0.025021	0.044818	0.067646	0.999691
000002_0003	0.01542	0.008608	0.010215	0.076214	0.004974	0.00893	0.089177	0.858667	0.012667	0.244505	...	0.003388	0.009051	0.040380	0.009120	0.131694	0.005950	0.014247	0.044818	0.067646	0.999691
000002_0004	0.01542	0.008608	0.009334	0.076214	0.004974	0.00893	0.089177	0.858667	0.012667	0.244505	...	0.003388	0.017866	0.078745	0.009120	0.204007	0.005950	0.014247	0.028634	0.088025	0.999691

	count	unique	top	freq
guitar	29620525	2	True	16736215
drums	29620525	2	True	19005563
voice	29620525	2	True	20766522
synthesizer	29620525	2	False	21567964
violin	29620525	2	False	27651523
piano	29620525	2	False	28699240
mallet_percussion	29620525	2	False	29007329
flute	29620525	2	False	29105788
bass	29620525	2	False	29150189
cello	29620525	2	False	29165508
saxophone	29620525	2	False	29211328
organ	29620525	2	False	29290712
accordion	29620525	2	False	29428347
harmonica	29620525	2	False	29451312
bagpipes	29620525	2	False	29452449
trumpet	29620525	2	False	29461391
trombone	29620525	2	False	29468003
cymbals	29620525	2	False	29489953
ukulele	29620525	2	False	29493362
harp	29620525	2	False	29556563
banjo	29620525	2	False	29563360
mandolin	29620525	2	False	29587198
clarinet	29620525	2	False	29598953

	accordion	bagpipes	banjo	bass	cello	clarinet	cymbals	drums	flute	guitar	...	mandolin	organ	piano	saxophone	synthesizer	trombone	trumpet	ukulele	violin	voice
000046_0053	0.062583	0.915700	0.018637	0.162992	0.040727	0.018444	0.100536	0.534406	0.033820	0.976719	...	0.010782	0.714110	0.300557	0.087529	0.385306	0.079978	0.075567	0.030833	0.160921	0.794626
000311_0001	0.006875	0.001383	0.002581	0.696345	0.006392	0.003727	0.033844	0.651003	0.023251	0.955172	...	0.003475	0.034006	0.156438	0.015048	0.539802	0.004837	0.006913	0.021388	0.040315	0.907296
000341_0068	0.024864	0.001311	0.028881	0.163791	0.020460	0.008838	0.041862	0.584283	0.018548	0.609002	...	0.026970	0.123292	0.663510	0.030823	0.886328	0.012027	0.011733	0.251555	0.057407	0.970078
000368_0158	0.010230	0.005692	0.028237	0.502167	0.014982	0.004697	0.082778	0.646268	0.029445	0.822505	...	0.018546	0.076291	0.458668	0.025077	0.938266	0.006261	0.009553	0.034601	0.047432	0.670003
000402_0003	0.027663	0.014083	0.009137	0.107472	0.176484	0.018028	0.028532	0.300647	0.058136	0.271525	...	0.004556	0.554254	0.270120	0.031469	0.611561	0.038181	0.039272	0.010984	0.643716	0.946539
000644_0672	0.012434	0.014400	0.018587	0.270604	0.256813	0.067379	0.045278	0.427513	0.132918	0.987838	...	0.018609	0.162592	0.245520	0.445257	0.072400	0.703359	0.640719	0.121939	0.180639	0.834285
001023_0083	0.013796	0.005061	0.003343	0.416631	0.460000	0.835826	0.005285	0.134761	0.307062	0.477020	...	0.004188	0.093485	0.137888	0.544438	0.066581	0.091138	0.089995	0.009233	0.199989	0.549448
001033_0024	0.023249	0.012236	0.011249	0.184627	0.028087	0.071508	0.116394	0.717136	0.437962	0.626129	...	0.010670	0.071872	0.229724	0.039138	0.511643	0.013548	0.028309	0.046583	0.179872	0.632515
001145_0137	0.003957	0.000918	0.000558	0.203969	0.146485	0.006395	0.010076	0.124820	0.023060	0.859799	...	0.000823	0.567757	0.338795	0.018945	0.538661	0.009014	0.007018	0.002454	0.139204	0.894864
001214_1090	0.001480	0.000647	0.000639	0.587898	0.009738	0.002642	0.139987	0.929737	0.004773	0.968694	...	0.000823	0.011013	0.118221	0.004195	0.743730	0.003819	0.004699	0.002008	0.013668	0.075813

	count	unique	top	freq
guitar	2300	2	True	1153
voice	2300	2	False	1177
drums	2300	2	False	1325
synthesizer	2300	2	False	1499
violin	2300	2	False	1526
piano	2300	2	False	2008
cello	2300	2	False	2011
mallet_percussion	2300	2	False	2018
flute	2300	2	False	2023
saxophone	2300	2	False	2032
bass	2300	2	False	2036
trumpet	2300	2	False	2069
accordion	2300	2	False	2077
organ	2300	2	False	2083
harmonica	2300	2	False	2091
trombone	2300	2	False	2093
bagpipes	2300	2	False	2111
ukulele	2300	2	False	2132
cymbals	2300	2	False	2137
banjo	2300	2	False	2208
harp	2300	2	False	2211
mandolin	2300	2	False	2239
clarinet	2300	2	False	2271